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METHOD AND SYSTEM FOR LOW- 
OVERHEAD MEASUREMENT OF PER- 
THREAD PERFORMANCE INFORMATION 
IN A MULTITHREADED ENVIRONMENT 



BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to an improved data pro- 
cessing system and, in particular, to a method and apparatus 
for optimizing performance in a data processing system. 
Still more particularly, the present invention provides a 
method and apparatus for a software program development 
tool for enhancing performance of a software program 
through performance profiling. 

2. Description of Related Art 

Effective management and enhancement of data process- 
ing systems requires knowing how and when various system 
resources are being used. In analyzing and enhancing per- 
formance of a data processing system and the applications 
executing within the data processing system, it is helpful to 
know which software modules within a data processing 
system are using system resources. Performance tools are 
used to monitor and examine a data processing system to 
determine resource consumption as various software appli- 
cations are executing within the data processing system. For 
example, a performance tool may identify the most fre- 
quently executed modules and instructions in a data pro- 
cessing system, may identify those modules which allocate 
the largest amount of memory, or may identify those mod- 
ules which perform the most I/O requests. Hardware-based 
performance tools may be built into the system and, in some 
cases, may be installed at a later time, while software-based 
performance tools may generally be added to a data pro- 
cessing system at any time. 

In order to improve performance of program code, it is 
often necessary to determine how time is spent by the 
processor in executing code, such efforts being commonly 
known in the computer processing arts as locating "hot 
spots." Ideally, one would like to isolate such hot spots at 
various levels of granularity in order to focus attention on 
code which would most benefit from improvements. 

For example, isolating such hot spots to the instruction 
level permits compiler developers to find significant areas of 
suboptimal code generation at which they may focus their 
efforts to improve code generation efficiency. Another poten- 
tial use of instruction level detail is to provide guidance to 
the designer of future systems. Such designers employ 
profiling tools to find characteristic code sequences and/or 
single instructions that require optimization for the available 
software for a given type of hardware. 

Most software engineers are more concerned about the 
efficiency of applications at higher levels of granularity, such 
as the source code statement level or source code module 
level. For example, if a software engineer can determine that 
a particular module requires significant amounts of time to 
execute, the software engineer can make an effort to increase 
the performance of that particular module. In addition, if a 
software engineer can determine that a particular module is 
sometimes invoked unnecessarily, then the software engi- 
neer can rewrite other portions of code to eliminate unnec- 
essary module executions. 

Various hardware-based performance tools are available 
to professional software developers. Within state-of-the-art 
processors, facilities are often provided which enable the 
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processor to count occurrences of software-selectable events 
and to time the execution of processes within an associated 
data processing system. Collectively, these facilities are 
generally known as the performance monitor of the proces- 

5 sor. Performance monitoring is often used to optimize the 
use of software in a system. 

A performance monitor is generally regarded as a facility 
incorporated into a processor to monitor selected operational 
characteristics of a hardware environment to assist in the 

10 debugging and analyzing of systems by determining a 
machine's state at a particular point in time or over a period 
of time. Often, the performance monitor produces informa- 
. tion relating to the utilization of a processor's instruction 
execution and storage control. For example, the performance 
monitor can be utilized to provide information regarding the 

15 amount of time that has passed between events in a pro- 
cessing system. As another example, software engineers 

• may utilize timing data from the performance monitor to 
optimize programs by relocating branch instructions and 

* memory accesses. In addition, the performance monitor may 
20 be utilized to gather data about the access times to the data 

: processing system's LI cache, L2 cache, and main memory. 
Utilizing this data, system designers may identify perfor- 
mance bottlenecks specific to particular software or hard- 
ware environments. The information produced usually 

25 guides system designers toward ways of enhancing perfor- 
mance of a given system or of developing improvements in 
the design of a new system. 

To obtain performance information, events within the data 
processing system are counted by one or more counters 

30 within the performance monitor. The operation of such 
counters is managed by control registers, which are com- 
prised of a plurality of bit fields. In general, both control 
registers and the counters are readable and writable by 
software. Thus, by writing values to the control register, a 

35 user may select the events within the data processing system 
to be monitored and specify the conditions under which the 
counters are enabled. 

In addition to the hardware-based performance tools and 
techniques discussed above, software-based performance 

40 tools may also be deployed in a variety of manners. One 
known software-based performance tool is a trace tool. A 
trace tool may use more than one technique to provide trace 
information that indicates execution flows for an executing 
application. One technique keeps track of particular 

45 sequences of instructions by logging certain events as they 
occur, so-called event-based profiling technique. For 
example, a trace tool may log every entry and corresponding 
exit into and from a module, subroutine, method, function, 
or system component within the executing application. 

50 Typically, a time-stamped record is produced for each such 
event. Corresponding pairs of records similar to entry-exit 
records may also be used to trace execution of arbitrary code 
segments, starting and completing I/O or data transmission, 
and for many other events of interest. Output from a trace 

55 tool may be analyzed in many ways to identify a variety of 
performance metrics with respect to the execution environ- 
ment. 

However, instrumenting an application to enable trace 
profiling may undesirably disturb the execution of the appli- 

60 cation. As the application executes, the instrumentation code 
may incur significant overhead, such as system calls to 
obtain a current timestamp or other execution state infor- 
mation. In fact, the CPU time consumed by the instrumen- 
tation code alone may effect the resulting performance 

65 metric of the application being studied. 

Another problem in profiling an application is that 
unwanted and/or unexpected effects may be caused by the 
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system environment to the state information that the profil- Within an executing application, a first counter is read to 

ing processes are attempting to capture. Since most com- get a first performance metric beginning value, and a second 

puter systems are interruptable, multi-tasking systems, the counter is read to get a second performance metric begin- 

operating system may perform certain actions underneath ning value. These values may be obtained near the entry of 

the profiling processes, unbeknownst to the profiling pro- 5 a method with which an analyst desires to observe its 

cesses. The most prevalent of these actions is a thread- execution performance. The application then continues to 

switch. execute. Near the exit of the method, the first counter is read 

While a profiling process is attempting to capture state to get a first performance metric ending value, and the 

information about a particular thread, i.e., a thread-relative second counter is read to get a second performance metric 

metric, the system may perform a thread switch. Once the io ending value. A first performance metric delta value is then 

profiling process obtains the state information, the informa- computed from the first performance metric beginning value 

tion may have changed due to the thread switch. The and the first performance metric ending value, and a second 

subsequent execution of the other thread may render the performance metric delta value is computed from the second 

desired, thread-relative, performance metric partially performance metric beginning value and the second perfor- 

skewed or completely unreliable. If the profiling process 15 mance metric ending value. A determination can then be 

does not perform any actions to determine the prevalence of made as to whether the first performance metric delta value 

thread switches, the profiling process cannot determine the is reliable as a thread-relative performance metric value 

reliability of certain performance metrics. representing the first performance metric based upon 

Significant efforts in the prior art have been employed to whether the second performance metric delta value is zero, 

minimize or negate the effects of thread switches on perfor- 20 i n one type of performance observation, the first counter 

mance metrics captured for profiling purposes. Typical solu- and the second counter are performance monitor counters in 

tions to tracking performance information on a per-thread a processor in the data processing system, and the perfor- 

basis require significant kernel support because of the need mance monitor counters are read directly by application- 

to identify thread dispatch events initiated by the kernel and level code, 
then to associate the consumed performance metric with 25 

these events so that an accurate accounting of resource BRIEF DESCRIPTION OF THE DRAWINGS 

consumption on each thread can be obtained. Infrastructure The novel features believed characteristic of the invention 

to support this kind of accounting can be quite expensive, are set f orm j Q t he appended claims. The invention itself, 

and in some cases, such as environments that lack an open, however, as well as a preferred mode of use, further objec- 

extensible, kernel programming interface, the infrastructure 30 tives and a <j va ntages thereof; will best be understood by 

may be difficult or impossible to deploy. reference to the following detailed description of an illus- 

Therefore, it would be advantageous to provide a method trative embodiment when read in conjunction with the 

and system for isolating the thread-relative performance accompanying drawings, wherein: 

metrics from the effects caused by thread-switching, thereby mG 1A depicts a typical distributed data processing 

enhancing the accuracy of thread-relative metrics captured 35 system m which me present invention may be implemented; 

by the profiling processes. It would be particularly advan- nG lfi d ^ a ^ ^ mat 

tageous £ provide a mechanism i that employs the hardware- be used ^ a ^ of ^ ^ which ^ ^ 

based efficiencies provided by the performance monitoring mvention ^ implemented; 
facilities of a processor. 

40 FIG. 2 A is a block diagram depicting typical internal 
SUMMARY OF THE INVENTION functional units of a data processing system in which the 
A method, system, apparatus, or computer program prod- present invention may be implemented; 
uct is presented for low-overhead performance measurement FIG. 2B is an illustration depicting an example represen- 
of an application executing in a data processing system in tation of one configuration of an MMCR suitable for con- 
order to generate per-thread performance information in a 45 trolling the operation of two PMCs; 
multithreaded environment. While a first set of events is FIG. 3A is a diagram of software components within a 
being gathered or monitored for a particular thread as a first computer system that illustrates a logical relationship 
metric, events that may indirectly cause inaccuracies in the between the components as layers of functions; 
first metric are also monitored as a second metric, and the nG 3B mustrates a typical relationship of software 
presence of a positive value for the second metric is then so components within a computer system that supports a Java 
used to determine that the first metric is inaccurate or environment in which the present invention may be imple- 
unreliable. If the first metric is deemed inaccurate, then the mented' 

first metric may be discarded. When the first metric is mr ir « , ^ M ; rt< ;„„ ■ l„„„ 

.i * . £ t . . . j MG. 3C is a diagram depicting various processing phases 

gathered without the occurrence of an event being counted . * • « j * 5 i , • r 

6 , , . c t t . . B . , " that are typically used to develop trace information; 

as the second metric, then the first metric is considered 55 ™« . J . . . „ , . 

accurate. The first metric may then be considered a "thread- FIG ' 4 s ^ ows a iim&haG de P"*ng the temporally relative 

relative" metric as it has been observed during a time period P°sitionsof events concerning a typical problem in measur- 

in which no events would have caused the first metric to m S a Performance metric in a multithreaded environment; 

become inaccurate during the execution of a particular F 10 - 5 shows a timeline depicting the temporally relative 

thread. For example, the first metric could be a value of a 60 positions of events concerning the measurement of a per- 

consumed resource, such as a number of executed formance metric in a multithreaded environment in accor- 

instructions, while the second metric is a number of dance ^ a first embodiment of the present invention; 

interrupts, each of which might cause the kernel to initiate FIG. 6 A is a flowchart depicting the steps that may be 

a thread switch. While the interrupt is serviced, the first necessary to configure a processor for counting events in 

metric continues to monitor resource consumption, yet the 65 accordance with a first embodiment of the present invention; 

second metric indicates that the first metric is inaccurate FIG. 6B is a flowchart depicting the process of determin- 

with respect to a particular thread. ing the value of a thread-relative performance metric and its 
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validity using two performance monitor counters in accor- 
dance with a first embodiment of the present invention; and - 

FIG. 7 is a flowchart depicting the process of determining 
the value of a thread-relative performance metric and its 
validity using a performance monitor counter and a kernel- 5 
derived thread dispatch counter in accordance with a second 
embodiment of the present invention. 

DETAILED DESCRIPTION OF THE 

INVENTION io 

With reference now to the figures, FIG. 1A depicts a 
typical distributed data processing system in which the 
present invention may be implemented. Distributed data 
processing system 100 contains network 101, which is the 
medium used to provide communications links between 15 
various devices and computers connected together within 
distributed data processing system 100. Network 101 may 
include permanent connections, such as wire or fiber optic 
cables, or temporary connections made through telephone or 
wireless communications. In the depicted example, server 20 
102 and server 103 are connected to network 101 along with 
storage unit 104. In addition, clients 105-107 also are 
connected to network 101. Clients 105-107 may be a variety 
of computing devices, such as personal computers, personal 
digital assistants (PDAs), etc. Distributed data processing 25 
system 100 may include additional servers, clients, and other 
devices not shown. In the depicted example, distributed data 
processing system 100 may include the Internet with net- 
work 101 representing a worldwide collection of networks 
and gateways that use the TCP/IP suite of protocols to 30 
communicate with one another. Of course, distributed data 
processing system 100 also may be configured to include a 
number of different types of networks, such as, for example, 
an intranet, a local area network (LAN), or a wide area 
network (WAN). 35 

FIG. 1A is intended as an example of a heterogeneous 
computing environment and not as an architectural limita- 
tion for the present invention. The present invention could 
be implemented on a variety of hardware platforms, such as 4Q 
server 102 or client 107 shown in FIG. 1A. Requests for the 
collection of thread-relative metric information may be 
initiated on a first device within the network, while a second 
device within the network receives the request, collects the 
thread-relative metric information with respect to the second 45 
device, and returns the collected data to the first device. 

With reference now to FIG. IB, a diagram depicts a 
typical computer architecture that may be used within a 
client or server, such as those shown in FIG. 1A, in which 
the present invention may be implemented. Data processing 50 
system 110 employs a variety of bus structures and proto- 
cols. Processor card 111 contains processor 112 and L2 
cache 113 that are connected to 6XX bus 115. System 110 
may contain a plurality of processor cards; processor card 
116 contains processor 117 and L2 cache 118. 55 

6XX bus 115 supports system planar 120 that contains 
6XX bridge 121 and memory controller 122 that supports 
memory card 123. Memory card 123 contains local memory 
124 consisting of a plurality of dual in-line memory modules 
(DIMMs) 125 and 126. 60 

6XX bridge 121 connects to PCI bridges 130 and 131 via 
system bus 132. PCI bridges 130 and 131 are contained on 
native I/O (NIO) planar 133 which supports a variety of I/O 
components and interfaces. PCI bridge 131 provides con- 
nections for external data streams through network adapter 65 
134 and a number of card slots 135-136 via PCI bus 137. 
PCI bridge 130 connects a variety of I/O devices via PCI bus 
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138. Hard disk 139 may be connected to SCSI host adapter 
140, which is connected to PCI bus 138. Graphics adapter 
141 may also be connected to PCI bus 138 as depicted, either 
directly or indirectly. 

ISA bridge 142 connects to PCI bridge 130 via PCI bus 
138. ISA bridge 142 provides interconnection capabilities 
through NIO controller 153 via ISA bus 144, such as serial 
connections 145 and 146. Floppy drive connection 147 
provides removable storage. Keyboard connection 148 and 
mouse connection 149 allow data processing system 110 to 
accept input data from a user. 

Non-volatile RAM (NVRAM) 150 provides non-volatile 
memory for preserving certain types of data from system 
disruptions or system failures, such as power supply prob- 
lems. System firmware 151 is also connected to ISA bus 144 
and controls the initial BIOS. Service processor 154 is 
connected to ISA bus 144 and provides functionality for 
system diagnostics or system servicing. 

Those of ordinary skill in the art will appreciate that the 
hardware in FIG. IB may vary depending on the system 
implementation. For example, the system may have one or 
more processors, and other peripheral devices may be used 
in addition to or in place of the hardware depicted in FIG. 
IB. The depicted examples are not meant to imply archi- 
tectural limitations with respect to the present invention. 

With reference now to FIG. 2A, a block diagram depicts 
selected internal functional units of a data processing system 
that may include the present invention. System 200 com- 
prises hierarchical memory 210 and processor 230. Hierar- 
chical memory 210 comprises Level 2 cache 202, random 
access memory (RAM) 204, and disk 206. Level 2 cache 202 
provides a fast access cache to data and instructions that may 
be stored in RAM 204 in a manner which is well-known in 
the art. RAM 204 provides main memory storage for data 
and instructions that may also provide a cache for data and 
instructions stored on nonvolatile disk 206. 

Data and instructions may be transferred to processor 230 
from hierarchical memory 210 on instruction transfer path 
220 and data transfer path 222. Instruction transfer path 220 
and data transfer path 222 may be implemented as a single 
bus or as separate buses between processor 230 and hierar- 
chical memory 210. Alternatively, a single bus may transfer 
data and instructions between processor 230 and hierarchical 
memory 210 while processor 230 provides separate instruc- 
tion and data transfer paths within processor 230, such as 
instruction bus 232 and data bus 234. 

Processor 230 also comprises instruction cache 231, data 
cache 235, performance monitor 240, and instruction pipe- 
line 233. Performance monitor 240 comprises performance 
monitor counter (PMC1) 241, performance monitor counter 
(PMC2) 242, performance monitor counter (PMC3) 243, 
performance monitor counter (PMC4) 244, and monitor 
mode control register (MMCR) 245. Alternatively, proces- 
sor 230 may have other counters and control registers not 
shown. 

Processor 230 includes a pipelined processor capable of 
executing multiple instructions in a single cycle, such as the 
PowerPC family of reduced instruction set computing 
(RISC) processors. During operation of system 200, instruc- 
tions and data are stored in hierarchical memory 210. 
Instructions to be executed are transferred to instruction 
pipeline 233 via instruction cache 231. Instruction cache 231 
contains instructions that have been cached for execution 
within pipeline 233. Some instructions transfer data to or 
from hierarchical memory 210 via data cache 235. Other 
instructions may operate on data loaded from memory or 
may control the flow of instructions. 
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Performance monitor 240 comprises event detection and methodology of the present invention may be applied to any 
control logic, including PMC1-PCM4 241-244 and MMCR general software environment, the following examples illus- 
245. Performance monitor 240 is a software-accessible trate the present invention applied to a Java environment, 
mechanism intended to provide detailed information with With reference now to FIG. 3A, a prior art diagram shows 
significant granularity concerning the utilization of proces- 5 software components within a computer system illustrating 
sor instruction execution and storage control. The perfor- a logical relationship between the components as layers of 
mance monitor may include an implementation-dependent functions. The kernel (Ring 0) of the operating system 
number of performance monitor counters (PMCs) used to provides a core set of functions that acts as an interface to 
count processor/storage related events. These counters may the hardware. I/O functions and drivers can be viewed as 
also be termed "global counters". The MMCRs establish the 10 resident in Ring 1, while memory management and 
function of the counters with each MMCR usually control- memory-related functions are resident in Ring 2. User 
ling some number of counters. The PMCs and the MMCRs applications and other programs (Ring 3) access the func- 
are typically special purpose registers physically residing on tlons m the other lavers t0 Perform general data processing, 
the processor. These registers are accessible for read or write Rui g s °~ 2 > ™ a whole > ma y be viewed as the operating 
operations via special instructions for that purpose. The 15 svstem of a particular device. Assuming that the operating 
write operation is preferably only allowed in a privileged or svstem 15 extensible, software drivers may be added to the 
supervisor state, while reading is allowed in a problem state operating system to support various additional functions 
since reading the special purpose registers does not change required by user applications, such as device drivers for 
a register's content. In a different embodiment, these regis- su PP ort of new devices added to the system, 
ters may be accessible by other means such as addresses in 20 wim reference now to FIG. 3B, a block diagram illus- 
I/O space. One skilled in the art will appreciate that the size Irates a typical relationship of software components within 
and number of the counters and the control registers are a computer system that supports a Java environment in 
dependent upon design considerations, including the cost of which the present invention may be implemented. Java- 
manufacture, the desired functionality of processor 230, and based system 360 contains platform specific operating sys- 
the chip area available within processor 230. 25 tem 362 that provides hardware and system support to 

Performance monitor 240 monitors the entire system and executing on a specific hardware platform. The 

accumulates counts of events that occur as the result of °P eratm S system general y supports the simultaneous 

processing instructions. The MMCRs are partitioned into bit ^f^*!^* W 1 ™ 1 ! 0 ™' and } h * Java u Virtual 

fields that allow for event/signal selection to be recorded/ MachlDe < JVM ) 364 15 f° ftware a PP^ation that may 

counted. Selection of an allowable combination of events 30 execute in conjunction with the operating system. JVM 364 

causes the counters to operate concurrently. provides a Java runtime environment with the ability to 

T „ 7 . . _ ^ ii • -1 execute Java application or applet 366, which is a program, 

WUh reference now to FIG. 2B, an lUustration provides an ^ of softwafe Dent witten m me Java 

example representation of one configuration of an MMCR mi , Jhe com m in which 3fi4 

suitable for controlling the operation of two PMCs. As es be siffliIar tQ ^ ocessi tem no 

shown in the example an MMCR is partitioned into a shown in FIG . IB. However, JVM 364 may be implemented 

number of bit fields whose settings select events to be m dedicated hafdware QQ a ^ Java chi 0f Jaya 

counted enable performance monitor interrupts .and specify ^ ^ aQ embedded JVM core. 

the conditions under which counting is enabled. A *#u * * t * • <i. n/w 

° At the center of a Java runtime environment is the JVM, 
For example, given that an event under study is "register ^ which supports all of Ws environment, including 
accesses" a PMC may represent the number of register il5 architecturej ^^iy features, mobility across networks, 
accesses for a set of instructions. The number of register and p i at f orm independence. The JVM is a virtual computer, 
accesses for the particular set of instructions might be added { e a mat & spec ified abstractly. The Java speci- 
to the total event count in a PMC that counts all register fications define certain fealures that every JVM must 
accesses by all instructions in a particular time period. A 45 unp lementf wit h some range of design choices that may 
user-readable counter, e.g., PMC1, may then provide a depend upon the platform on which the JVM is designed to 
software-accessible number of register accessessince PMC1 execute . For examp i e , all JVMs must execute Java byte- 
was first iniUalized. With the appropriate values, the perfor- codes and may use a range of techniques t0 execute tne 
mance monitor is readily configured for use in identifying a instructions represented by the bytecodes. A JVM may be 
variety of system performance characteristics. 5Q imputed completely in software or somewhat in hard- 
Bits 0-4 and 18 of the MMCR in FIG. 2B determine the ware. This flexibility allows different JVMs to be designed 
scenarios under which counting is enabled. By way of for a variety of hardware platforms, such as mainframe 
example, bit 0 may be a freeze -counting bit such that when computers and PDAs. 

the bit is set, the values in the PMCs are not changed by with reference now to FIG. 3C, a diagram depicts various 

hardware events, i.e. counting is frozen. Bits 1-4 may 55 processing phases that are typically used to develop trace 

indicate other specific conditions under which counting is information. Trace records may be produced by the execu- 

performed. Bits 5, 16, and 17 are utilized to control interrupt t ion of small pieces of code called "hooks". Hooks may be 

signals. Bits 6-15 may be utilized to control time or event- inserted in various ways into the code executed by 

based transitions. Bits 19-25 may be used for event selection processes, including statically (source code) and dynami- 

for PMC1, i.e. selection of signals to be counted for PMC1. 60 cally (through modification of a loaded executable). After 

Bits 26-31 may be used for event selection for PMC2, i.e. hooks have already been inserted into the process or pro- 

selection of signals to be counted for PMC2. The function cesses of interest, i.e., after the code has been 

and number of bits may be chosen as necessary for selection "instrumented", then profiling may be performed. Instru- 

of events as needed within a particular implementation. mented code is not typically present in production quality 

In addition to being able to be implemented on a variety 65 code because instrumentation changes the size and perfor- 

of hardware platforms, the present invention may be imple- mance of the code to be analyzed; hence, instrumented code 

mented in a variety of software environments. Although the is typically not delivered within a final version of an 
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application. After an application has been instrumented with 
trace hooks, the application may be executed, and the 
generated trace information is captured and analyzed, as 
shown in FIG. 3C. 

An initialization phase 390 is used to capture the state of 5 
a machine at the time tracing is initiated. This trace initial- 
ization data may include trace records that identify all 
existing threads, all loaded classes, and all methods for the 
loaded classes. A record may be written to indicate when all 
of the startup information has been written. 10 

Next, during the tracing phase 392, trace records are 
written to a trace buffer or trace file. Trace hooks may be 
used to record data upon the execution of the hook, which 
is a specialized piece of code at a specific location in a 
routine, module, or program. Trace hooks are typically 15 
inserted for the purpose of debugging, performance analysis, 
or enhancing functionality. These trace hooks may generate 
trace data that is stored in a trace data buffer. The trace data 
buffer may be subsequently stored in a file for post- 
processing. Trace records captured from hooks are written to 20 
indicate thread switches, interrupts, loading and unloading 
of classes and jitted methods, etc. Subject to memory 
constraints, the generated trace output may be as long and as 
detailed as the analyst requires for the purpose of profiling 
a particular program. 25 

In the post-processing phase 394, the data collected in the 
trace file is analyzed. Each trace record may be processed in 
a serial manner. Trace records may contain a variety of 
information. A trace record may identify, directly or 3Q 
indirectly, the thread and method that was being executed at 
the time that the trace record was generated; this information 
may be generated originally as binary identifiers that are 
subsequently converted to text strings. After all of the trace 
records are processed, the information may be formatted for 35 
output in the form of a report or may be used for other 
post-processing activities. 

More information about tracing and trace processing may 
be found in the following patent applications: "PROCESS 
AND SYSTEM FOR MERGING TRACE DATA FOR 40 
PRIMARILY INTERPRETED METHODS", U.S. applica- 
tion Ser. No. 09/343,439, currently pending, filed on Jun. 30, 
1999; and "METHOD AND SYSTEM FOR MERGING 
EVENT-BASED DATA AND SAMPLED DATA INTO 
POSTPROCESSED TRACE OUTPUT', U.S. application 45 
Ser. No. 09/343,438, currently pending, filed on Jun. 30, 
1999. 

The present invention may be implemented on a variety of 
hardware and software platforms, as described above. More 
specifically, though, the present invention is directed to a 50 
technique of allowing for the accurate collection of thread- 
relative performance metrics. The technique includes some 
loss of information, i.e. it is heuristic in nature. However, in 
controlled environments, such as benchmarks, or in heavily 
repetitious environments, such as server application 55 
environments, the technique yields useful results. 

As noted previously, obtaining accurate thread-relative 
metrics is generally difficult. Since most computer systems 
are interruptable, multi-tasking systems, the operating sys- 
tem may perform a thread-switch underneath the profiling 60 
processes, unbeknownst to the profiling processes. While a 
profiling process is attempting to capture state information 
about a particular thread, i.e., a thread-relative metric, the 
system may perform a thread switch. By the time that the 
profiling process actually obtains the state information, the 65 
desired information may have changed due to the thread 
switch. The subsequent execution of a different thread may 
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render the desired, per-thread, performance metric partially 
skewed or completely unreliable. If the profiling process 
does not perform any actions to determine the prevalence of 
thread switches, the profiling process cannot determine the 
reliability of certain performance metrics. 

As an illustration, consider the problem of tracking the 
number of instructions executed between the start and the 
end of a method in a Java application. Hiis type of infor- 
mation may be useful for comparing the number of executed 
instructions in different versions of a method to determine 
which version is more efficient in CPU time or other 
computational resources. 

With reference now to FIG. 4, a timeline depicts the 
temporally relative positions of events concerning a typical 
problem in measuring a performance metric in a multi- 
threaded environment. In this explanation; the example 
specifically addresses the problem of measuring the number 
of instructions executed in a Java method. However, any 
specific section of code could be bracketed by a set of 
operations that attempt to discern a performance metric. 

At time point 402, the kernel dispatches thread A, which 
may comprise several methods. At the start of one of those 
methods, at time point 404, a hardware counter is read and 
its value stored in another processor register or a memory 
location. The hardware counter may be a performance 
monitor counter that has been previously configured to count 
the number of instruction completion events within the 
processor that is executing thread A. The value of the first 
reading of the performance monitor counter acts as a base 
value for a subsequent comparison. 

Unbeknownst to thread A, the kernel then performs a 
thread switch at time point 406. The processor continues to 
execute instructions and to count the instruction completion 
events; the fact that the instructions represent separate 
execution threads is not relevant to the operation of the 
performance monitor counters. 

At time point 408, the kernel performs another thread 
switch during which thread A is activated and thread B is 
suspended. At the end of the same method, at time point 410, 
the same hardware counter that represents completed 
instruction events is again interrogated, and at time point 
412, the two instruction counts are compared to compute an 
instruction count difference or delta. The method may then 
report the delta as the number of instructions executed 
within itself. 

As can be seen, since execution control was lost by thread 
A in the midst of the method, its accounting is in error and 
includes instruction completion events that occurred while 
thread B had execution control. The present invention 
attempts to mitigate such effects so that accurate, thread- 
relative metrics can be obtained. 

Significant efforts in the prior art have been employed to 
minimize or negate the effects of thread switches on thread- 
relative performance metrics captured for profiling pur- 
poses. Typical solutions to tracking performance informa- 
tion on a per-thread basis generally require significant 
amounts of kernel support because of the need to identify 
thread dispatch events initiated by the kernel and then to 
associate the consumed performance metric with these 
events so that an accurate accounting of resource consump- 
tion on each thread can be obtained. For example, some of 
these prior art solutions actually track performance monitor 
counter values within the kernel at interrupt times and at 
thread dispatch, thereby incurring a long path length in the 
interrupt and dispatch execution paths. Kernel data struc- 
tures that correspond with user space (and kernel space) 
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threads are used to accumulate the needed performance 
monitor counter values. These values can be interrogated 
from the ring 3 or application level on method entry and exit; 
however, a system call is required to obtain access to these 
kernel data structures. The data structures cannot simply be 
mapped into the application level because they are complex 
and require serialization (or locking) to access them safely. 
Infrastructure to support this kind of accounting can be quite 
expensive, and in some cases, such as environments that 
lack an open, extensible, kernel programming interface, the 
infrastructure may be difficult or impossible to deploy. 

The technique employed by a first embodiment of the 
present invention relies on the following observations and 
assumptions. In most instances, for most function or method 
calls, a function or method completes without interruption. 
However, an active function or method may be forced to 
relinquish control of a CPU for several reasons: (1) a timer 
interrupt, originally requested by the operating system to 
apportion time slices to threads, is received by the operating 
system, thereby causing the operating system to suspend the 
currently active thread (and its methods); (2) a hardware 
interrupt is received and requires appropriate servicing; and 
(3) the currently executing thread is blocked while waiting 
for the completion of an input/output action or the relin- 
quishment of some type of lock. 

The first embodiment of the present invention recognizes 
that meaningful, thread-relative metrics at the application 
level may be obtained directly from the hardware in most 
cases, even though thread-switches may occur at any time as 
determined by the kernel. Rather than attempting to track 
thread switches by somehow obtaining kernel-level infor- 
mation concerning thread switches, the first embodiment of 
the present invention instead performs the following novel 
actions. While a first set of events is being gathered or 
monitored for a particular thread as a first metric, events that 
may indirectly cause a thread switch are also monitored as 
a second metric, and the presence of positive value for the 
second metric is then used to determine that the first metric 
is inaccurate or unreliable. If the first metric is deemed 
unreliable, then the first metric is discarded, and the fact that 
the sample was discarded is then recorded in order to 
provide an indication of the reliability of the thread-relative 
measurements. 

In this manner, the kernel-level information concerning 
thread-switches is by-passed, and hence, the difficulties in 
obtaining kernel-level information about thread-switches is 
eliminated. When the first metric is gathered without the 
occurrence of an event being counted as the second metric, 
then the first metric is considered accurate. The first metric 
may then be considered a "thread- relative" metric as it has 
been gathered during a time period in which no events would 
have caused the kernel to initiate a thread switch. If the 
second metric is positive, then the first metric is discarded, 
and that fact may be recorded. 

In an alternative scenario, rather than being concerned 
only about events that cause a thread switch, one may also 
monitor asynchronous events, e.g., timer interrupts, that may 
not cause a thread switch yet obtain execution control and 
cause resources to be consumed, thereby affecting the 
counted events for the thread of interest. These types of 
asynchronous events may also be monitored as a second 
metric, and the presence of a positive value for the second 
metric is then used to determine that the first metric is 
inaccurate or unreliable. 

Hence, the first embodiment of the present invention can 
perform thread-relative performance accounting at the appli- 
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cation level with low overhead. A particular metric is 
computed from values obtained directly from the hardware, 
and the metric is determined to be a thread-relative metric 
based on other metric values obtained directly from the 
5 hardware. 

Returning to the example of tracking the number of 
instructions executed between the start and the end of a 
method in a Java application, the advantage of the first 
embodiment of the present invention can be observed. 

10 With reference now to FIG. 5, a timeline depicts the 
temporally relative positions of events concerning the mea- 
surement of a performance metric in a multithreaded envi- 
ronment in accordance with a first embodiment of the 
present invention. In this explanation, the example specifi- 

15 cally addresses the problem of measuring the number of 
instructions executed in a Java method. At time point 502, 
the kernel dispatches thread Y, which may comprise several 
methods. At the start of one of those methods, at time point 
504, a first hardware counter is read and its value stored in 

20 another processor register or a memory location. The first 
hardware counter may be a first performance monitor 
counter that has been previously configured to count the 
number of instruction completion events within the proces- 
sor that is executing thread Y. The value of the first reading 

25 of the first performance monitor counter acts as a base value 
for a subsequent comparison. 

Immediately after reading the first performance monitor 
counter, at time point 505, a second hardware counter can be 
read and its value stored in a processor register or a memory 
location. The second hardware counter may be a second 
performance monitor counter that has been previously con- 
figured to count the number of hardware interrupt events 
within the processor. The value of the first reading of the 
second performance monitor counter acts as a base value for 
a subsequent comparison. 

Unbeknownst to thread Y, the kernel then performs a 
thread switch at time point 506. The processor continues to 
execute instructions and to count the instruction completion 

40 events and interrupts, if necessary; the fact that the instruc- 
tions represent separate execution threads is not relevant to 
the operation of the performance monitor counters. 

At time point 508, the kernel performs another thread 
switch during which thread Y is activated and thread Z is 

45 suspended. At the end of the same method, at time point 510, 
the first hardware counter is again interrogated, and at the 
end of the same method, at time point 511, the second 
. hardware counter is again interrogated. At time point 512, 

- the two instruction counts are compared to compute an 

5 q instruction count difference or delta. At time point 513, the 
' two interrupt counts are compared to compute an interrupt 
count difference or delta, 

t In this example, since the kernel performs at least two 

H thread switches from thread Y to thread Z back to thread Y, 

55 at least two interrupts have occurred to initiate the thread 
switches. The number of interrupts being counted by the 
second performance monitor counter is greater at time point 
511 than at time point 506, and the computed instruction 
count delta is deemed inaccurate at time point 514. 

60 In general, if any hardware interrupts have occurred 
during the time period being studied, then the accuracy of 
the first counter is deemed inaccurate. If no interrupts have 
occurred, then the method may report the delta as the 
number of instructions executed within itself. 

65 With reference now to FIG. 6 A, a flowchart depicts the 
steps that may be necessary to configure a processor for 
counting events in accordance with a first embodiment of the 
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present invention. A first performance monitor counter is However, one disadvantage of the first embodiment of the 
configured to measure the consumption of a thread-relative present invention is that the process may discard the thread- 
resource, such as counting the number of executed instruc- relative performance metric too often. Although the kernel 
tions (step 602). A second performance monitor counter is mav perform a thread switch immediately after an interrupt 
configured to measure occurrence of a hardware-level event 5 m response to the interrupt, it does not necessarily always 
that is relevant to thread switches, such as interrupts (step perform a thread switch after each interrupt. Hence an 
604). For example, a monitor mode control register in the interrupt may occur that does not cause a thread switch, but 
processor can be configured to control the performance * e methodology of reading « i performance monitor 
monitor counters by executing one or more specialized counl ^ r each mterrupt will be considered to have 
. . . o 4U . & r An caused a thread switch. In this manner, even though most 
instructions for this purpose. 10 Qf of code may compl * e ^ 

If necessary, an optional step may need to be performed out ^ intervening interrupt, the methodology may "over- 
to permit application-level read operations of the perfor- count" the number of thread switches by considering each 
mance monitor counters (steps 606). For example, depend- interrupt as having a one-to-one correspondence with a 
ing upon the processor, the specialized instructions for thread switch. For example, certain system calls may gen- 
controlling the performance monitor registers might initially 15 erate a "soft" interrupt that is directly attributable to the 
be available only to privileged processes, such as kernel- immediately preceding system call, and the interrupt should 
level processes. Before application-level operations can be not disrupt the performance monitoring process and cause 
performed on the performance monitor, the kernel may need the performance metric to be discarded. In other words, any 
to perform some operation, such as executing one or more resource consumed while servicing such interrupts for sys- 
special instructions, to enable the application-level opera- 20 te m calls should be attributed to the calling thread, 
tions. The steps shown in FIG. 6B merely represent that the Another disadvantage is that, in certain cases, some 
processor may have to be configured in some manner before resource consumption may be counted that ideally would not 
application-level operations may be performed, and the be counted. For example, suppose that there is no interrupt 
specific steps may be dependent upon the hardware platform and 110 thread switch me execution of a method that 
on which the present invention is implemented. 25 * being timed or measured but the thread executes some 
, - , . type of sleep function that places the thread in sleep mode 

With reference now to FIG. 6B, a flowchart depots the fof a particular ^ perio(L tnread has yielded control 

process of determining the value of a thread-relative per- 0 f the processor for a smaU period of time, yet the thread was 

formance metric and its validity using two performance not actua Uy executing instructions during that time. Other 

monitor counters in accordance with a first embodiment of examples of undercounting and overcounting could also 

the present invention. The process shown in FIG. 6B may illustrate similar problems. 

follow the process shown in FIG. 6A immediately or at a ^ embodiment of the present mvention uses a 
much later point in time. methodology similar to the first embodiment of the present 
The process begins by reading a first performance monitor invention, but an attempt is made to obtain a more accurate 
counter to get a base number of events representing a 35 reS ult. By extending existing kernel functionality to count 
consumed resource (step 610). A difference operation is thread dispatches and/or asynchronous interrupts or other 
performed during the period under observation because most desired events, and then making the count available to an 
hardware environments do not allow application-level application, a more accurate but still very efficient method- 
writes to the performance monitor registers. Therefore, the ology is presented for characterizing thread-relative 
application cannot initialize the performance monitor 4Q resource consumption at the application level. Hence, the 
counters to zero. The process continues by reading a second "overcounting" or "undercounting" of the first embodiment 
performance monitor counter to get a base number of events & remedied by having an accurate count of asynchronous 
that may cause thread switches (step 612). A region of code interrupts, or thread switches, or other events from the 
is then executed while various hardware operations are kernel level. 

observed by the performance monitor (step 614). 45 M additional b^fit of this approac h is that some of the 

The process continues by reading the first performance consumption of resources that otherwise would not have 

monitor counter to get the ending number of events repre- been captured by the methodology used in the first embodi- 

senting a consumed resource (step 616), and the second ment of the present invention are then properly captured in 

performance monitor counter is read to get an ending the thread-relative metric. For example, the monitored 

number of events that may cause thread switches (step 618). 50 thread may cause a particular interrupt, and the consumption 

The value of the consumed resource is computed from the of resources during the handling of that particular interrupt 

values that were read from the first performance monitor should be attributed to the thread and captured in the 

counter (step 620). A number of events that could have thread-relative performance metric. By not discarding the 

caused potential thread switches is computed from the performance monitor counter after certain interrupts, as 

values that were read from the second performance monitor 55 might occur with the methodology of the first embodiment 

counter (step 622). The reliability of the value of the of the present invention, the accurate retrieval of a thread 

consumed resource in representing a thread-relative perfor- switch count from the kernel level provides a more accurate 

mance metric is then determined from the existence of a performance metric. Hence, the consumption of resources is 

potential thread switch, i.e. an increase in the value of the captured in the thread-relative performance metric during 

second performance monitor counter (step 624). 60 specialized processing that was caused by the thread being 

The advantages of the first embodiment of the present monitored for the thread-relative performance metric, 
invention should be apparent in view of the detailed descrip- With reference now to FIG. 7, a flowchart depicts the 
tion of the invention that is provided above. The entire process of determining the value of a thread-relative per- 
process is performed relatively simply, and the reading of formance metric and its validity using a performance moni- 
the performance monitor counters requires very low over- 65 tor counter and a kernel-derived thread dispatch counter in 
head. Another advantage is that the process does not require accordance with a second embodiment of the present inven- 
special kernel instrumentation. tion. 
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A thread dispatch counter is maintained at the kernel A third embodiment of the present invention is similar to 

level. As the kernel dispatches a thread, either by spawning the second embodiment of the present invention. However, 

a new thread or activating a sleeping thread, the kernel rather than having the application read a performance moni- 

increments the thread dispatch counter. The kernel provides t „ . _ . . , r mntm * n , # . t „„ a1 

r „ . r i. „. t i . i . i , tor counter to track the consumption or a resource, the kernel 

some manner for allowing applications to read the thread 5 , 4 5 . , 

dispatch counter, i.e. the thread dispatch counter is effec- ma y track the consumption of a resource and also provide 

lively mapped to the application level. thls value at me nn g 3 or apphcation level m a kernel- 

As an example of how the application may have quick, mapped memory area that is user-space accessible. Hence, 

efficient access to the thread dispatch counter, the thread the application can simply read kernel-maintained global 

dispatch counter may be a global counter that is maintained 1Q counters to obtain both the thread dispatch counter (or 

by the kernel in a kernel-mapped memory area. The kernel asynchronous interrupt or other event or resource of interest) 

has previously mapped an area of memory for user-space aad counter value ntin the of a 

access. Ring 3 access to the global counter is provided . , _ . , T . „ 

through a simple read of a variable and not through a system Particular resource. In any case, the value of the first 

ca ll performance metric is employed to determine the reliability 

Again, although a thread dispatch counter is used as the 15 of the value representing the consumption of a resource or 

second performance metric in the example, the second lne second performance metric. If the value is deemed 

performance metric stored in memory may represent another ' accurate, then it represents a thread-relative performance 

consumed resource or another observed event that disquali- metric for that particular thread. If it is deemed inaccurate, 

fies or invalidates the first performance metric obtained from 2Q then that fact may be recorded and used as part of a 

the performance monitor. post-processing analysis. 

In a manner similar to that shown in FIG. 6A, a configu- 
ration process may be required to initialize the performance \ li should be noted that the value of the performance metric 
monitor counter and the kernel. For example, the application ' of interest could be a kernel variable, which the kernel 
to be monitored may have been instrumented, and its instru- 2 $ wou ^ d ma P to a kernel-maintained memory area for user- 
mentation code may call a function within the interface to space access. Alternatively, a global counter may contain a 
the kernel that allows the kernel to perform its own variable that is already somewhat accessible to the user- 
initialization, which might also include configuring the S p ace , e.g., a global counter maintained in the JVM's 
performance monitor. address spacCj such as a C0UQt of bytes ^locMcd from the 

After an optional configuration, the process begins by 30 Java heap In any case> system calk tQ ^ kemd afe 

reading a performance monitor counter to get a base number eliminated and laced b si le m reads . 
oi events representing a consumed resource (step 702). A 

thread dispatch counter is read to get a base number for the For each of the three embodiments above, at some point 

thread dispatches (step 704). A region of code is then after the collection of a performance metric, probably during 

executed while various hardware operations are observed by 35 post-processing, the thread-relative performance metric may 

the performance monitor (step 706). be reportcd t0 a user m manner . The report can be 

TTie process continues by reading the performance moni- eahanced by ^y^g more than just the mean performance 

tor counter to get the ending number of evente representing mcidc Qr resource Qne cm ^ Q ^ 

a consumed resource (step 708), and the thread dispatch , . 

counter is read to get its ending number (step 710). The 40 ^ ^ *' g \ t T° * ' ***** Pr ° VldeS * 

value of the consumed resource is computed from the values S°° d ™& c *™ of ^re is much variability in the 

that were read from the performance monitor counter (step reported metric. 

712). A number of thread dispatches is computed from the For exarnple) the classic use of a g, oba] is the 

values that were read from the second performance monitor . „ , • . t . rt1 , rt1 , a , - m A . A # , 

i*a\ t-u i- i e iL 1 r iL situation in which the global counter is read at the entry and 

counter (step 714). The reliability of the value of the 45 c j . . i_ • 

j ' . j 1 *■ _r exit of a routme in order to attribute resource consumption 

consumed resource in representing a thread-relative perfor- , , . v 

mance metric is then determined from the existence of a b y the cunently executing thread to that specific routme. 

thread dispatch, i.e. a non-zero difference between the base can be done 10 a variety of ways. In one example, it can 

number of thread dispatches and the ending value of the oe accumulated into a counter associated with that thread 

thread dispatch counter (step 716). If a thread dispatch has 50 a nd routine (e.g., using a two dimensional array, COUNT 

occurred during the execution of the section of code that is [routine_id, thread_id]=COUNT[routine_id, thread_id]+ 

being observed, then the observed value for the consumed delta_value). In another example, the metric could be 

resource is deemed unreliable as it possibly also contains the recorded in a trace buffer maintained in user-space. In a third 

consumption of the resource by at least one other thread. example> ^ metric could be recofded m a data structure mat 

The advantages of the second embodiment of the present 55 tracks calling and resources consumed along 

invention should be apparent in view of the detailed descnp- different calline paths 
tion of the invention that is provided above. Similar to the 

first embodiment, the entire process is performed relatively Instead of simply accumulating the delta value at the exit 

simply, and the reading of the performance monitor counters of the instrumented method, if one records key aspects of the 

requires very low overhead. Additionally, only a single 60 distribution of occurrences, e.g., the values themselves, the 

performance monitor counter is used, and because there are momen t of the values, and the minimum and maxi- 

only a limited number of these counters in a processor, the mum values> then lhis data can be examined to help 

light use of performance monitor counters may be important Ufy the validh of ^ daU recordedi 
to the execution environment. Although the process in the 

second embodiment does require a small amount of special 65 For example, suppose after some number of exits of 

kernel instrumentation, it is also performed relatively simply method A, a history of elapsed instructions were executed in 

with relatively low overhead. the manner shown in Table 1. 
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Delta number of 


Delta number of 


Invocation number 


instructions 


dispatches/interrupts 


1 


55 


0 


2 


55 


0 


3 


55 


0 


4 


10,601 


0 


5 


55 


0 


6 


55 


0 



There is some evidence in Table 1 that invocation number 4 
was anomalous. It may have occurred because of an unde- 
tected thread switch or interrupt occurring. In this case, one 
can (with some human intervention and interpretation) posit 
the removal of that invocation from the sample. 

Alternatively, one might observe the execution history of 
a number of exits of method A as shown in Table 2. 



TABLE 2 





Delta number of 


Delta number of 


Invocation number 


instructions 


dispatches/interrupts 


1 


255 




2 


255 




3 


255 




4 


255 




5 


255 




6 


255 





There is evidence in Table 2 that, while there were some 
number of interrupts and dispatches, the instruction path 
length is nonetheless constant. An experienced analyst might 
use this information to conclude that the interrupts/ 
dispatches in this case were actually synchronous, so 
resources consumed during these invocations should be 
attributed to the thread. 

In both cases shown in Table 1 and Table 2, one uses the 
information about the distribution of the metric of interest to 
further qualify the validity of the data collected. This infor- 
mation is being used in conjunction with the interrupt and 
dispatch counts to provide further help with the interpreta- 
tion of the data. In all cases, this requires a mode of analysis 
that is more post -processing in nature. 

It is important to note that while the present invention has 
been described in the context of a fully functioning data 
processing system, those of ordinary skill in the art will 
appreciate that the processes of the present invention are 
capable of being distributed in the form of a computer 
readable medium of instructions and a variety of other forms 
and that the present invention applies equally regardless of 
the particular type of signal bearing media actually used to 
carry out the distribution. Examples of computer readable 
media include media such as EPROM, ROM, tape, paper, 
floppy disc, hard disk drive, RAM, and CD-ROMs and 
transmission-type media such as digital and analog commu- 
nications links. 

The description of the present invention has been pre- 
sented for purposes of illustration and description but is not 
intended to be exhaustive or limited to the invention in the 
form disclosed. Many modifications and variations will be 
apparent to those of ordinary skill in the art. The embodi- 
ment was chosen and described in order to best explain the 
principles of the invention, the practical application, and to 
enable others of ordinary skill in the art to understand the 
invention for various embodiments with various modifica- 
tions as are suited to the particular use contemplated. 
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What is claimed is: 

1. A method for monitoring the performance of an appli- 
cation executing in a data processing system, the method 
comprising: 

reading a first counter to get a first performance metric 
beginning value for a first performance metric; 

reading a second counter to get a second performance 
metric beginning value for a second performance met- 
ric; 

executing at least a portion of the application; 
reading the first counter to get a first performance metric 
ending value; 

reading the second counter to get a second performance 
metric ending value; 

computing a first performance metric delta value from the 
first performance metric beginning value and the first 
performance metric ending value; 

computing a second performance metric delta value from 
the second performance metric beginning value and the 
second performance metric ending value; and 

determining whether the first performance metric delta 
value is reliable as a thread-relative performance metric 
value representing the first performance metric during 
the execution of the portion of the application based 
upon whether the second performance metric delta 
value is zero. 

2. The method of claim 1 wherein the first counter and the 
second counter are performance monitor counters in a 
processor in the data processing system, and wherein the 
first counter and the second counter are read directly by 
application-level code. 

3. The method of claim 1 wherein the first counter is a 
performance monitor counter in a processor in the data 
processing system and the second counter is a kernel- 
maintained counter, and wherein the first counter and the 
second counter are read directly by application-level code. 

4. The method of claim 3 wherein the second counter is 
in a kernel-mapped area of memory. 

5. The method of claim 1 wherein the first counter and the 
second counter are kernel-maintained counters, and wherein 
the first counter and the second counter are read directly by 
application-level code from a kernel-mapped area of 
memory. 

6. The method of claim 1 wherein the second performance 
metric represents a number of interrupts. 

7. The method of claim 3 wherein the second performance 
metric represents a number of thread dispatch events. 

8. The method of claim 1 wherein the first performance 
metric beginning value and the second performance metric 
beginning value are read substantially near an invocation of 
a routine and wherein the first performance metric ending 
value and the second performance metric ending value are 
read substantially near an end of the routine. 

9. The method of claim 1 wherein the thread-relative 
performance metric value is accumulated on a thread- 
relative basis during the execution of the application. 

10. The method of claim 1 wherein the thread-relative 
performance metric value is reported as a portion of thread- 
relative performance data. 

11. The method of claim 1 wherein the first performance 
metric delta value, the second performance metric delta 
value, the first performance metric beginning value, the 
second performance metric beginning value, the first per- 
formance metric ending value, and/or the second perfor- 
mance metric ending value are reported as a portion of 
performance data for use in statistical, post-processing after 
the execution of the application. 
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12. A data processing system comprising: 

first reading means for reading a first counter to get a first 
performance metric beginning value for a first perfor- 
mance metric; 

second reading means for reading a second counter to get 5 
a second performance metric beginning value for a 
second performance metric; 

execution means for executing at least a portion of the 
application; 

third reading means for reading the first counter to get a 10 
first performance metric ending value; 

fourth reading means for reading the second counter to get 
a second performance metric ending value; 

first computing means for computing a first performance 
metric delta value from the first performance metric 15 
beginning value and the first performance metric end- 
ing value; 

second computing means for computing a second perfor- 
mance metric delta value from the second performance 
metric beginning value and the second performance 20 
metric ending value; and 

determining means for determining whether the first per- 
formance metric delta value is reliable as a thread- 
relative performance metric value representing the first 
performance metric during the execution of the portion 25 
of the application based upon whether the second 
performance metric delta value is zero. 

13. The data processing system of claim 12 wherein the 
first counter and the second counter are performance moni- 
tor counters in a processor in the data processing system, and 30 
wherein the first counter and the second counter are read 
directly by application-level code. 

14. The data processing system of claim 12 wherein the 
first counter is a performance monitor counter in a processor 

in the data processing system and the second counter is a 35 
kernel-maintained counter, and wherein the first counter and 
the second counter are read directly by application-level 
code. 

15. The data processing system of claim 14 wherein the 
second counter is in a kernel-mapped area of memory. 40 

16. The data processing system of claim 12 wherein the 
first counter and the second counter are kernel-maintained 
counters, and wherein the first counter and the second 
counter are read directly by application-level code from a 
kernel-mapped area of memory. 45 

17. The data processing system of claim 12 wherein the 
second performance metric represents a number of inter- 
rupts. 

18. The data processing system of claim 14 wherein the 
second performance metric represents a number of thread 50 
dispatch events. 

19. The data processing system of claim 12 wherein the 
first performance metric beginning value and the second 
performance metric beginning value are read substantially 
near an invocation of a routine and wherein the first perfor- 55 
mance metric ending value and the second performance 
metric ending value are read substantially near an end of the 
routine. 

20. The data processing system of claim 12 wherein the 
thread-relative performance metric value is accumulated on 60 
a thread-relative basis during the execution of the applica- 
tion. 

21. The data processing system of claim 12 wherein the ' 
thread-relative performance metric value is reported as a 
portion of thread-relative performance data. 65 

22. The data processing system of claim 12 wherein the 
first performance metric delta value, the second performance 
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metric delta value, the first performance metric beginning 
value, the second performance metric beginning value, the 
first performance metric ending value, and/or the second 
performance metric ending value are reported as a portion of 
performance data for use in statistical, post-processing after 
the execution of the application. 

23. A computer program product in a computer-readable 
medium for use in a data processing system for monitoring 
the performance of an application executing in the data 
processing system, the computer program product compris- 
ing: 

instructions for reading a first counter to get a first 
performance metric beginning value for a first perfor- 
mance metric; 

instructions for reading a second counter to get a second 
performance metric beginning value for a second per- 
formance metric; 

instructions for executing at least a portion of the appli- 
cation; 

instructions for reading the first counter to get a first 
performance metric ending value; 

instructions for reading the second counter to get a second 
performance metric ending value; 

instructions for computing a first performance metric 
delta value from the first performance metric beginning 
value and the first performance metric ending value; 

instructions for computing a second performance metric 
delta value from the second performance metric begin- 
ning value and the second performance metric ending 
value; and 

instructions for determining whether the first performance 
metric delta value is reliable as a thread-relative per- 
formance metric value representing the first perfor- 
mance metric during the execution of the portion of the 
application based upon whether the second perfor- 
mance metric delta value is zero. 

24. The computer program product of claim 23 wherein 
the first counter and the second counter are performance 
monitor counters in a processor in the data processing 
system, and wherein the first counter and the second counter 
are read directly by application-level code. 

25. The computer program product of claim 23 wherein 
the first counter is a performance monitor counter in a 
processor in the data processing system and the second 
counter is a kernel-maintained counter, and wherein the first 
counter and the second counter are read directly by 
application-level code. 

26. The computer program product of claim 25 wherein 
the second counter is in a kernel-mapped area of memory. 

27. The computer program product of claim 23 wherein 
the first counter and the second counter are kernel- 
maintained counters, and wherein the first counter and the 
second counter are read directly by application-level code 
from a kernel-mapped area of memory. 

28. The computer program product of claim 23 wherein 
the second performance metric represents a number of 
interrupts. 

29. The computer program product of claim 25 wherein 
the second performance metric represents a number of 
thread dispatch events. 

30. The computer program product of claim 23 wherein 
the first performance metric beginning value and the second 
performance metric beginning value are read substantially 
near an invocation of a routine and wherein the first perfor- 
mance metric ending value and the second performance 
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metric ending value are read substantially near an end of the 
routine. 

31. The computer program product of claim 23 wherein 
the thread-relative performance metric value is accumulated 
on a thread-relative basis during the execution of the appli- 5 
cation. 

32. The computer program product of claim 23 wherein 
the thread-relative performance metric value is reported as a 
portion of thread-relative performance data. 



33. The computer program product of claim 23 wherein 
the first performance metric delta value, the second perfor- 
mance metric delta value, the first performance metric 
beginning value, the second performance metric beginning 
value, the first performance metric ending value, and/or the 
second performance metric ending value are reported as a 
portion of performance data for use in statistical, post- 
processing after the execution of the application. 
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